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Abstract — Recently there is a line of research work proposing 
to employ Spectral Clustering (SC) to segment (group) 1 high- 
dimensional structural data such as those (approximately) lying 
on subspaces 2 or low-dimensional manifolds. By learning the 
affinity matrix in the form of sparse reconstruction, techniques 
proposed in this vein often considerably boost the performance 
in subspace settings where traditional SC can fail. Despite 
the success, there are fundamental problems that have been 
left unsolved: the spectrum property of the learned affinity 
matrix cannot be gauged in advance, and there is often 
one ugly symmetrization step that post-processes the affinity 
for SC input. Hence we advocate to enforce the symmetric 
positive semidefinite constraint explicitly during learning (Low- 
Rank Representation with Positive SemiDefinite constraint, or 
LRR-PSD), and show that factually it can be solved in an 
exquisite scheme efficiently instead of general-purpose SDP 
solvers that usually scale up poorly. We provide rigorous 
mathematical derivations to show that, in its canonical form, 
LRR-PSD is equivalent to the recently proposed Low-Rank 
Representation (LRR) scheme [1], and hence offer theoretic 
and practical insights to both LRR-PSD and LRR, inviting 
future research. As per the computational cost, our proposal 
is at most comparable to that of LRR, if not less. We validate 
our theoretic analysis and optimization scheme by experiments 
on both synthetic and real data sets. 

Keywords -spectral clustering, affinity matrix learning, rank 
minimization, robust estimation, eigenvalue thresholding 

I. Introduction 

This paper deals with grouping or segmentation of high- 
dimensional data under subspace settings. The problem is 
formally defined as follows 

Problem 1 (Subspace Segmentation). Given a set of suf- 
ficiently dense data vectors X = [xx,--- ,x„], £ R d 
representing a sample \fi = 1, • • ■ , n. Suppose the data are 
drawn from a union of k subspaces {Sj} i=1 of unknown 
dimensions {di} i=1 respectively, segment all data vectors 
into their respective subspaces. 

In this regard, the vast number of available clustering 
algorithms, ranging from the most basic k-means method to 
the most elegant and sophisticated spectral clustering (SC) 
method, can all be used towards a solution. Nevertheless, 

1 Throughout the paper, we use segmentation, clustering, and grouping, 
and their verb forms, interchangeably. 

2 We follow [1] and use the term "subspace" to denote both linear 
subspaces and affine subspaces. There is a trivial conversion between linear 
subspaces and affine subspaces as mentioned therein. 



there are strong reasons to believe that exploiting the very 
regularity associated with the data can enhance the clustering 
performance. 

We choose SC as the basic framework for subspace 
segmentation. SC has been extensively researched (see [2] 
for a recent review) and employed for many applications 
(e.g. image segmentation [3] in computer vision). SC has 
the remarkable capacity to deal with highly complicated data 
structures which may easily fail simple clustering methods 
such as k-means. The excellent performance of SC can be 
partially explained via its connection to the kernel method 
which has been extensively studied in machine learning, 
specially the recent unification of weighted kernel k-means 
and SC [4]. The implicit data transformation into higher- 
dimensional spaces is likely to make the clustering task easy 
for basic clustering algorithms. 

Analogous to the freedom to choose the kernel function 
in kernel methods, SC is flexible enough to admit any 
similarity measures in the form of affinity matrices as its 
input. Despite the existence of research work on SC with 
general affinity matrices that are not positive semidefinite 
(see e.g., [5]), in practice the Gaussian kernel s(x,,Xj) = 
exp (— ||xj — Xj || 2 /u 2 ) and the linear kernel s(Xj,Xj) = 
x T x 3 are evidently the most commonly employed. Use of 
these kernels naturally ensures about symmetry and positive 
semidefiniteness of the affinity matrix. When there are 
processing steps that cause asymmetry, e.g., construction of 
nearest-neighbors based affinity matrix, there is normally an 
additional symmetrization step involved before the subse- 
quent eigen-analysis on the resultant Laplacian matrix in 
normal SC routines [2]. 

A. Prior Work on Subspace Segmentation 

The most intuitive way to solve the subspace segmentation 
problem is perhaps by robust model fitting. In this aspect, 
classic robust estimation methods such as RANSAC [6] and 
Expectation Minimization [7] can be employed, based on 
some assumptions about the data distribution, and possibly 
also the parametric form (e.g., mixture of Gaussians). 

Most customized algorithms on this problem, however, are 
contributed from researchers in computer vision, to solve the 
3D multibody motion segmentation (MMS) problem (see 
e.g., [8] for the problem statement and review of existing 
algorithms). In this problem, geometric argument shows that 



trajectories of same rigid-body motion lie on a subspace of 
dimension at most 4. Hence MMS serves as a typical ap- 
plication of subspace segmentation. There are factorization 
based methods [9], algebraic methods exemplified by the 
Generalized Principal Component Analysis (GPCA) [10], 
and local subspace affinity (LSA) [11] to address MMS. 
They are all directly or indirectly linked to SC methods, 
and can be considered as different ways to construct the 
affinity matrix for subsequent SC (for the former is similar 
to the linear kernel 3 , and the latter two kernels defined with 
local subspace distances). 

Of special interest to the current investigation is the 
recent line of work on constructing the affinity matrix by 
sparsity-induced optimization. Cheng et al [13] (l\ graph ) 
and Elhamifar et al [14] (sparse subspace clustering, SSC) 
have independently proposed to use sparse reconstruction 
coefficients as similarity measures. To obtain the sparse 
coefficients, they reconstruct one sample using all the rest 
samples, while regularizing the coefficient vector by l\ norm 
to promote sparsity. Hence the problem to solve boils down 
to the Lasso (i.e., l\ -regularized least square problem, [15]), 
which has been well studied on theoretic and computational 
sides (ref e.g., [16]). Most recently Liu et al [1] has proposed 
to compute the reconstruction collectively, and regularize 
the rank of the affinity matrix for capturing global data 
structures. This is made possible by employing the nuclear 
norm minimization as a surrogate, and they also provide a 
robust version to resist noise and outliers. 

Nuclear norm minimization as a surrogate for rank min- 
imization is a natural generalization of the trace heuristic 
used for positive semidefinite matrices in several fields, such 
as control theory [17]. The need for rank minimization has 
theoretically stemmed from the exploding research efforts in 
compressed sensing sparkled by the seminal paper [18]. In 
fact, generalizing from vector sparsity to spectrum sparsity 
for matrices is natural. The practical driving forces come 
from applications such as collaborative filtering, sensor 
localization, to just name a few [19]. From the computational 
side, there are several cutting-edge customized algorithms 
for solving the otherwise large-scale SDP problem that is 
likely to plague most popular off-the-shelf SDP solvers. 
These algorithms include prominently singular value thresh- 
olding [20], accelerated proximal gradient (APG) [21], and 
augmented Lagrange multiplier (ALM) methods [22] (see 
[22] for a brief review). 

B. Our Investigation and Contributions 

We advocate to learn an affinity matrix that makes a 
valid kernel directly, i.e., being symmetric positive semidef- 
inite. This is one critical problem the previous sparse- 

3 Wei and Lin [12] have concurrently got similar results as produced in 
Sec.II-B of the current paper, and also they have identified the closed-form 
solution of LRR with that of the shape interaction matrix (SIM) in the 
classic factorization method. 



reconstruction and global low-rank minimization approaches 
has bypassed. Without consideration in this aspect, the 
empirical behaviors of the learnt affinity matrices are poorly 
justified. We will focus on the global framework proposed 
in [1] as the global conditions are easier to gauge and 
additional constraints on the learnt affinity matrix can be 
put in directly. 

We will constrain the affinity matrix to be symmetric 
positive semidefinite directly in LRR-PSD. Surprisingly, 
during analysis of connection with the canonical form of 
LRR proposed in [1], we find out the two formulations are 
exactly equivalent, and moreover we can accurately char- 
acterize the spectrum of the optimal solution. In addition, 
we successfully establish the uniqueness of the solution to 
both LRR and LRR-PSD, and hence correct one critical 
error reported in [1] stating that the optimal solutions are 
not unique. 

More interestingly, we show that our advocated formu- 
lation (LRR-PSD) in its robust form also admits a simple 
solution like that of LRR as reported in [1], but at a 
lower computational cost. As a nontrivial byproduct, we 
also provide a rigorous but elementary proof to nuclear- 
norm regularized simple least square problem with a posi- 
tive semidefinite constraint, which complements the elegant 
closed-form solution to the general form [20]. 

To sum up, we highlight our contributions in two aspects: 
1) we provide a rigorous proof of the equivalence between 
LRR and LRR-PSD, and establish the uniqueness of the 
optimal solution. In addition, we offer a sensible character- 
ization of the spectrum for the optimal solution; and 2) we 
show that Robust LRR-PSD can also be efficiently solved in 
a scheme similar to that of LRR but with notable difference 
at a possibly lower cost. 

II. Robust Low-Rank Subspace Segmentation 
with Semidefinite Guarantees 

We will first set forth the notation used throughout the 
paper, and introduce necessary analytic tools. The canonical 
optimization framework of learning a low-rank affinity ma- 
trix for subspace segmentation/clustering will be presented 
next, and the equivalence between LRR-PSD and LRR will 
be formally established. Taking on the analysis, we briefly 
discuss about the spectrum in the robust versions of LRR- 
PSD and LRR, and touches on other noise assumptions. We 
will then proceed to present the optimization algorithm to 
tackle LRR-PSD under noisy settings (i.e., Robust LRR- 
PSD). 

A. Notation and Preliminaries 

1) Summary of Notations: We will use bold capital and 
bold lowercase for matrices and vectors respectively, such 
as X and b, and use normal letters for scalars and entries 
of matrices and vectors, e.g., A, Xij (the (i,j) th entry of 
matrix X). We will consider real vector and matrix spaces 



exclusively in this paper and use M. n or R mx ™ alike to denote 
the real spaces of proper dimensionality or geometry. 

We are interested in five norms of a matrix X <E E mx ™. 
The first three are functions of singular values {e^} and they 
are: 1) the operator norm or induced 2-norm denoted by 
||X|| 2 , which is essentially the largest singular value cr max ; 

2) the Frobenius norm, defined as ||X||j? = [J2i=i °i) > 
and 3) the nuclear norm, or sum of the singular values 
IIXH* = Y2i=i a i< assuming d = min (to, n). The additional 
two include: 4) the matrix i\ norm ||X||i which generalizes 
the vector l\ norm to the concatenation of matrix columns; 
and 5) the group norm ||X||2,i, which sums up the £2 
norms of columns. Besides, the Euclidean inner product 
between matrices is (X, Y) = trace (X T Y). This also 
induces an alternative calculation of the Frobenius norm, 
||X|| F = v/trace (X T X). 

We will denote the spectrum (the set of eigenvalues) of 
a square matrix by A(N), for N £ R nxn (similarly the 
collection of singular values for a general matrix er (X)). 
We denote the set of all n x n real symmetric matrix by 
S n , and the corresponding positive semidefinite cone as 6>™ 
and Ne<S" <^ A (N) > and N £ S n , in which N 
is said to be positive semidefinite and simply designated as 
N y 0. We reiterate the requirement on symmetry here since 
conventionally definiteness is not defined for asymmetric 
matrices. 

In addition, we all use diag (X) and Diag (b) to mean 
taking the diagonal vector of a matrix and reshape a vector 
into a diagonal matrix, respectively. Other notations such as 
trace (X), rank(X) manifest themselves literally. 

2) Nuclear Norm Minimization and Rank Minimization: 
We choose to devote this subsection to reviewing more 
concrete properties about the nuclear norm due to its sig- 
nificance to this paper in particular and to the whole range 
of work on low-rank minimization in general. 

Definition 2 (Unitarily Invariant Norms). A matrix norm || • || 
is unitarily invariant if ||X| = ||UXV|| for all matrices X 
and all unitary matrices U and V (i.e. U™ 1 = U T , V -1 = 
V T ) of compatible dimension. 

Interestingly common encountered unitarily invariant 
norms are all functions of the singular values, and lie 
within two general families: 1) Schatten-p norms, arising 
from applying the p-norm to the vector of singular values, 

||X||s p = ofj ; and 2)Ky-Fan k norms, rep- 

resenting partial ordered sums of largest singular values, 
||X||ifirfc = X)»=i assuming k < d and o\ > ■ ■ ■ > a d 
all for d = min (to, n). Of our interest here is that the 
nuclear norm and Frobenius norm are Schatten-1 norm and 
Schatten-2 norms, respectively. This fact will be critical for 
several places of our later argument. 

Next we will state one crucial fact about the duality 
between the nuclear norm and the operator norm. For any 



norm || • ||, its dual norm || • \f is defined via the variational 
characterization [23] 

||X|| D = sup{(Y,X)|||Y||<l}, (1) 

Y 

where ||Y|| < 1 can always be taken as equality for the 
supremum to achieve, since the inner product (Y,X) is 
homogeneous w.r.t. Y. Then we have a formal statement 
about the duality 

Lemma 3 ( [24], Proposition 2.1). The dual norm of the 
operator norm \\ ■ H2 in ]R mx ™ is the nuclear norm || • ||*. 

In fact, the duality taken together with the characterization 
of dual norms implies (Y,X) < ||Y||2||X||„, which has 
been used extensively in the analysis of nuclear norm 
problems. 

Our last piece of review touches on the core of the 
problem, i.e., how rank minimization problems (NP-Hard) 
could be (conditionally) solved via nuclear norm minimiza- 
tion formulation which is convex. This myth lies with the 
concept of convex envelope, which means the tightest convex 
pointwise approximation to a function (tightest convex lower 
bound). Formally, for any (possibly nonconvex, e.g., the rank 
function currently under investigation) function / : C h-> K, 
where C denotes a given convex set, the convex envelope of 
/ is the largest convex function g such that g (x) < f (x) 
for all x £ C [17]. The following lemma relates the rank 
function to the nuclear norm via convex envelope 

Lemma 4 ( [17], Theorem 1, pp.54 and Sec. 5. 1.5 for 
proof). The convex envelope of rank (X) on the set 
{X £ R mxn I ||X|| 2 < 1} is the nuclear norm ||X||,. 

This lemma justifies the heuristic to use the nuclear norm 
as a surrogate for the rank. Much of recent work, e.g. low- 
rank matrix completion [19] and Robust Principal Compo- 
nent Analysis (RPCA) [25], proves theoretically under mild 
conditions, the optimization can be exactly equivalent. We 
will borrow heavily the idea of this surrogate, and build on 
the theoretical underpinnings to develop our formulation and 
analysis of LRR-PSD/LRR for subspace segmentation. 

B. Subspace Segmentation with Clean Data - An Amazing 
Equivalence 

To tackle the subspace segmentation problem, Liu et al [1] 
have proposed to learn the affinity matrix for SC via solving 
the following rank minimization problem 

(LRANK) min . rank (Z) , s.t. X = XZ. (2) 

As a convex surrogate, the rank objective is replaced by the 
nuclear norm, and hence the formulation 

(LRR) min. ||Z||*, s.t. X = XZ. (3) 

Instead, we advocate to solve the problem incorporating the 
positive semidefinite constraint directly to produce a valid 



kernel directly as argued above 

(LRR-PSD) min. ||Z||», s.t. X = XZ, Z ^ 0. (4) 

Liu et al [1] has established one important characterization 
about solution(s) to LRR for X £ R dxn and Z £ R nxn , 
provided the data have been arranged by their respective 
groups, i.e., the true segmentation. 

Theorem 5 ( [1], Theorem 3.1). Assume the data sampling 
is sufficient such that > rank(Xi) = d% and the 
data have been ordered by group. If the subspaces are 
independent then there exists an optimal solution Z* to 
problem LRR that is block-diagonal: 












z* 


















Zt 



(5) 



with Z* being an rii x rii matrix with rank(Z*) = d, Vi 

This observation is critical to good segmentation since 
affinity matrices with block diagonal structure (for sorted 
data as stated) favor perfect segmentation as revealed by 
theoretic analysis of SC algorithms (e.g., refer to [2] for 
brief exposition). There are, however, discoveries that are 
equally important yet to make. We will next state somewhat 
surprising results that we have derived, complementing 
Theorem 5 and providing critical insights in characterizing 
the (identical and unique) solution to LRR-PSD and LRR. 

Theorem 6. Optimization problem LRR has a unique 
minimizer Z*. Moreover there exists an orthogonal matrix 
Q e R nxn such that 



l r 




(6) 



Q T Z*Q 

where r — rank (X), and obviously Z* y 0. 

Three important corollaries follow immediately from The- 
orem 6. 

Corollary 7 (LRR-PSD/LRR Equivalence). The LRR prob- 
lem and LRR-PSD problem are exactly equivalent, i.e. with 
identical unique minimizers that are symmetric positive 
semidefinite. 

Proof: Z* in Theorem 6 naturally obeys LRR-PSD. ■ 

Corollary 8. Assume the setting in Theorem 5. The optimal 
solution Z* to problem LRR and LRR-PSD are block- 
diagonal as in Eq. (5). 

Proof: Follow directly from LRR-PSD/LRR equiva- 
lence and Theorem 5. ■ 

Corollary 9 (LRR-PSD/LRR/LRANK Equivalence). The 
optimal rank of Z in LRANK is the objective value obtained 
from LRR-PSD/LRR, i.e., rank(X). 



Proof: Proof of Theorem 6 (later) shows rank (Z) can- 
not be lower than rank (X) due to the constraint X = XZ. 
rank (X) is the optimal objective value for nuclear norm of 
Z since ||Z*||» = ||Q T Z*Q||* =rank(X). ■ 
The development of the results in Theorem 6 will testify 
the beautiful interplay between classic matrix computation 
and properties of nuclear norms we reviewed above. We 
present and validate several critical technical results preced- 
ing formal presentation of our proof. 

Lemma 10 ( [26], Lemma 7.1.2 on Real Matrices). If A £ 

l" x ", B £ R pxp , and M 6 R nxp satisfy 



AM = MB, rank (M) = p, 
then there exists an orthogonal Q £ R nxn such that 

Q T AQ = T 



Tn Tj.2 
T 22 



(7) 



(8) 



for T £ R nxn , Tn G R pxp and T 12 , and T 22 of compat- 
ible dimensions. Furthermore, A (Tn) = A (A) fl A (B). 4 

Since the proof is critical for subsequent arguments, we 
reproduce the sketch here for completeness. 
Proof: Let 



M = Q 



Ri 
o 



,Ri e 



vpxp 



(9) 



be a QR factorization 5 of M. By substituting this into Eq. (7) 
and rearranging we arrive at 



Tn 


Ti 2 ~ 




Ri 




Ri 


T 21 


T 22 _ 













B, where Q AQ = 



Tn T12 
T 2 i T 22 
(10) 

with Tn G R pxp and others of compatible dimension. 
Since Ri is nonsingular, T21R1 = implying T 2 i = 0. 
Moreover, TuRi = RiB Tn = RiBR^ 1 , suggesting 
Tn and B are similar and hence A (Tn) = A (B). Lemma 
7.1.1 [26] dictates that A (A) = A(T) = A (Tn)UA (T 22 ), 
which leads to the conclusion. ■ 
The next lemma deals with an important inequality 
of nuclear norms on vertically-partitioned or horizontally- 
partitioned matrices. 

Lemma 11 ( [27], Adaptation of Theorem 4.4, pp 33-34). 

Let A £ R mxn be partitioned in the form 



A = 



A 1 
A 2 



(resp. A = [Ai A 2 ] ; 



(ID 



and the sorted singular values of A be o\ > cr 2 > • • ■ > 
ad > and those of Ai be n > t 2 > • • ■ > for 

4 We follow the convention in Golub and van Loan [26] and use A (•) to 
denote the set of eigenvalues counting multiplicity. Hence it is not a normal 
set, and use of set operators here abuses their traditional definitions. 

5 Note that QR factorization may not be unique, complement to the 
freedom in choosing a basis for Null (M T ), which is dual to the column 
space of M. 



d = min (m, n). Then ||A||* > 
holds if and only if A 2 = 0. 



|Ai||*, where the equality 



Proof: The original proof in [27] has shown that crj > 
Ti, Vz = 1, • • • , d. This is because of and rf , Vi = 1, • • • , d 
are eigenvalues of A T A (resp. AA T ) and A^Ai (resp. 
AiA T ), respectively, and A T A = A T Ai + Aj A 2 (resp. 
AA T = A x Aj + A 2 Aj). Theorem 3.14 [27] dictates that 
of > rf + A^, Vi = 1, • • • , d, where A^ is the smallest 
eigenvalue of Aj A 2 (resp. A 2 A 2 T ). It follows Yli=i °t — 
r i' ar, d hence we have ||A||» > ||Ai||*. 
We next show stronger results saying that the inequality 
is strict unless A2 = 0. 



(= 
II Ax|| 

Vi = 
trace 



> Ti, Vi 



Ed 
i- 



= 1, ■ ■ ■ ,d, requiring ||A||* = 
Ti amounts to imposing cr* = Ti, 



2 

(ATA) 
(AAT) 

: l|Ai||5 
2 

; — 



or 



we 



0, 



•) Since o 

Ed 

1, • • • , d, which suggests YTi=i °f = E"=x r i ? 
A T A) = trace (A T Ai) (resp. trace (AA T ) 
trace (AiA T )), identically ||A||| = ||Ai|| 2 P . But 
also have from the above argument trace 
trace (A T Ai) + trace (AJA2) (resp. trace 
trace (AiA t ) + trace (A 2 A t )), or ||A||J, = 
HA2III'. Taking them together we obtain || A2 
implying A2 = 0. 

(<=) Simple substitution verifies the equality and also 
completes the proof. ■ 
Based on the above two important lemmas, we derived 
our main results as follows. 

Proof: (of Theorem 6) We first show the claim about 
the semidefiniteness of Z*, and then proceed to prove the 
uniqueness. 

(Semidefiniteness of Z*) By XZ = X, we have 
Z T X T = X T . Taking r independent columns from X T 
(i.e., r independent rows from X) and organize them into a 
submatrix M of X T , we obtain Z T M = MI. By Lemma 10 
and its proof, we have a QR factorization of M and one 
similarity transform of Z T as, respectively 

R 




M = [U U- 1 ] 



and 



[u u^] T z T [ 



T11 




T12 

T22 



u 

I, 





Tia 
T22 



(12) 



where XJ 1 - spans the complementary dual subspace of U. 
Moreover, we have obtained Tn = I because proof of 
Lemma 10 suggests Tn = RIR -1 = I. The dimension 
is obviously determined by rank of X, i.e., r = rank(X). 

We continue to show that towards minimal ||Z ||*, T12 = 
T22 = 0. By the unitary invariance property of nuclear 



norm, mm . 



Noticing that 



T12 
T22 



= [u u x ] 1 Z T U^ 



and Z T M = M results in two constraints 

Z T UR = UR and Z T U ± = U ± = 0. 



(13) 



(14) 



Under these constraints, we can always make Z 7 !] 1 - = 0, 
or effectively T12 = T22 = 0, attaining the minimizer in 
that 

T12 I, 
T 22 " 



II, 



(15) 



where the first inequality has followed from Lemma 1 1 and 
equality is obtained only when T12 = T22 = 0. Hence 
we have shown that Z T = UU T = Z y 0, as an optimal 
solution of LRR. 

(Uniqueness of Z*) Suppose a perturbed version Z' = 
Z* + H is also a minimizer. So XZ' = X (Z* + H) = 
X = XZ*, suggesting XH = or H is in Null (X) (which 
complements the row space). We have 

R 



H T X T 







H U U J 











(16) 



H T UR 







H T U 



0. 



where the last equality holds because R is nonsingular. If 
H ^ 0, we have 

, T 



[u u- 1 -] z' T [u u- 1 

rir 

(U J 



11 


T' 

12 







T' 
J -22. 





U T H T U J 



, T 



H T U J 



(17) 



where we have substituted the analytic values of Z* and its 
corresponding Ty,Vi,j = {1,2} as discussed above and 
the fact H T U = 0. Since H 1 !! 1 ^ (otherwise together 
with H T U = we would obtain H = 0), employing the 
inequality in Lemma 11 again we see that 

> 







llrl 



l|z*|| 

(18) 

In other words, the objective is strictly increased unless 
H T = or H = 0, which establishes the uniqueness, and 
also concludes the proof. ■ 

Remark 12. Nonuniqueness of QR factorization of M will 
not affect the uniqueness of Z* as follows. Suppose we 
choose another basis V for column space of M, obviously 
V and U must be related by a within-space rotation 
R, i.e., V = UR. Hence w.r.t. the new basis we have 
Z* = VV T = URR T U = UU T , as R T R = RR T = I. 

C. Robust Subspace Segmentation with Data Containing 
Outliers and Noises 

To account for noises and outliers, explicit distortion 
terms can be introduced into the objective and constraint. 
Hence we obtain the robust version of LRR-PSD and LRR 
respectively as follows 

min. ||Z||* +A||E|| f , s.t. X = XZ + E, Z h 0, (19) 

min.||Z||* +A||E|k, s.t. XZ + E = X. (20) 

We have used || ■ \\? to mean generic norms. We caution 
that we cannot in general expect these two versions to be 



equivalent despite the provable equivalence of LRR-PSD 
and LRR. Remarkably, the problem has changed much due 
to the extra variable E. Nevertheless, it is still possible to 
partially gauge the behaviors of the solutions as follows. 

Suppose an optimal E* is somehow achieved (i.e., we 
assume it is fixed), we are then only concerned with 



the partial ALM problem, we have 



s.t. XZ + E* = X, (Z t 0) 



(21) 



Since columns of E* must be in the column space of X, 
we assume E* = X<5E. Then we obtain from the equality 
constraint X (Z + 5E) = X. By employing similar process 
in the proof of Theorem 6, one can easily verify that 



[U U- L ] T (Z T -f-<JE T ) [U U x ] 



-"-11 

c 



rpZ 

± 12 

rpZ 

A 22 



+ 



<5E 



± 11 



A 12 

ITK5E 1 

J -22 



(22) 



where the notation is consistent with the proof in Theo- 
rem 6. So towards minimizing ||Z T ||„, we can always have 
Z T \J ± = 0, or Tf 2 T = Tf 2 T = 0, for any SE T . So the 
rest of spectrum of Z T is determined by Tfj , and we have 
that Z T can have at most r nonvanishing eigenvalues, where 
r = rank(X). Note that + Tff T ^ has r eigenvalues 

of 1, so spectrum of Tf x will be perturbation of that since 
the norm of T^f is in general small. This is also confirmed 
by our numerical experiments in IV-B2. 

Moreover, we have intentionally left the norm for E 
unspecified since it apparently depends on the noise model 
we assume. The use of ||E|| 2i i assumes the noise is sample- 
specific. In practice, however, a more natural assumption 
is uniformly random, i.e., each dimension of every data 
sample has the same chance of getting corrupted. In this 
case, the simple ||E||i will suffice. We demonstrate via 
experiments IV-C, and show that indeed || • Id is more robust 
in that case. 

The above comments about spectrum properties and noise 
model selection apply to both settings. 

D. Solving Robust LRR-PSD via Eigenvalue Thresholding 

The equivalence of LRR-PSD and LRR does not readily 
translate to the respective robust versions, and hence we need 
to figure out ways of solving the robust LRR-PSD. Due to 
the strong connection between these two problems, however, 
we will still try to employ the Augmented Lagrange Multi- 
pier (ALM) method (see e.g., [22]) to tackle this as in [1]. 

We first convert the problem into its equivalent form as 



min ||J||* + A||E|| £ , s.t. X = XZ + E, Z = J, Z y 0, 

Z,E,J 

(23) 

where we have used ||E||^ to mean generic norms. Forming 



min 1 1 J I 

Z.E,J>-0,Yi,Y 2 



AHEI 



(24) 



+ (Y 1 ,X-XZ-E) + (Y 2 ,Z-J) 
+|||X-XZ-E||| + |||Z-J|||. 

We can then follow the inexact ALM routine [22] to update 
Z, E, J, Yi, Y2 alternately. While fixing others, how to 
update E depends on the norm || • ||^. There are a bunch 
of norms that facilitate closed-form solutions, such as the 
|| • H2.1 discussed in [1] and || • ||i (see e.g., [22]). How to 
update J (J y 0) is the major obstacle to clean up. To be 
specific, we will be facing problem of this form to update 



M* 



argmin — ||M| 

M fJ- 



1 



M-G|||, s.t. M y 0, (25) 



where G may or may not be symmetric. We will next 
show in Theorem 14 that symmetric G facilitates a closed- 
form solution, and generalize this in Theorem 16 which 
basically states that asymmetric G also leads to a closed- 
form solution. Moreover, the major computational cost lies 
with eigen-decomposition of a symmetric square matrix, as 
compared with singular value decomposition of a square 
matrix of the same size in solving the counterpart in robust 
LRR. 

Lemma 13 ( [23], Lemma 3.2). For any block partitioned 
[A Bl 

matrix X = ^ ^ , this inequality holds 



|X| 



C D 



> 



A 
D 



= IIA|| 



IDI 



(26) 



Similar inequality also holds for the square of Frobenius 
II- 



norm 



Theorem 14. For any symmetric matrix S £ S n , tlie unique 
closed form solution to the optimization problem 



M* = argmin -||M|| 
m A* 



1 



|M-S||!,MhO, (27) 



takes the form 

M* = Q Diag [max(A - 0)] Q T , (28) 

whereby S = QAQ T , for A = Diag (A), is the 
spectrum(eigen-) decomposition of S and max(-,-) should 
be understood element-wise. 6 . 

Proof: Observing that the objective is strictly convex 
over a convex set, we assert there exists a unique minimizer. 
The remaining task to single out the minimizer. Symmetric S 
admits a spectrum factorization S = QAQ T , where Q 1 = 

6 Toh and Yun [21] have shed some light on the results (ref. Remark 3 
in their paper) but lack a detailed development and theoretic proof, and 
our proof is derived independent of their work. Moreover, solution to the 
general case as stated in the next theorem extends this results. 



Q T . We set M = Q T MQ, and hence the optimization in 
Eq. (27) can be cast into 



M* 



argmin — ||M| 

M P 



M 



A\\ F , M^O. (29) 



By the unitary invariance property of the Frobenius norm and 
the nuclear norm, and the fact that M > ^ Q T MQ >z 
with unitary (orthogonal) Q, we assert these two optimiza- 
tion jrjroblems are exactly equivalent (in the sense that M 
and M can be recovered from each other deterministically). 

Next we argue that a minimizer M* must be a diagonal 
matrix. Let f(M) = l/At||M||„ + 1/2||M - A\\ 2 F . In fact, 
for a non-diagonal matrix JMo, we can always restrict it to 
diagonal elements to get M d such that f(M d ) < /(Mo) 
by Lemma. 13 and the fact A being diagonal. The strict 
inequality holds since restriction from a non-diagonal matrix 
to its diagonal elements results in strict decrease in square of 
the Frobenius norm. So assuming M = Diag - , £ n ) 
and A = Diag(Ai,--- , A n ), the problem reduces to a 
quadratic program w.r.t. {^i}™ =1 

-, n 1 n 

mti = arg min - ]T & + - £ 1 1 & - A, 1 1 2 , & > 0, Vi. 

(30) 

The programming is obviously separable and simple ma- 
nipulation suggests the unique closed form solution £* = 
max(Aj — 0), which concludes the proof. ■ 

Remark 15. Note that uniqueness of the solution may not 
be directly translated from Eq. (29) to (27) since one may 
argue Q is not unique in general. There are three causes 
to the ambiguity: 1) general sign reversal ambiguity of 
eigenvectors, 2) freedom with eigenvectors corresponding 
to the zero eigenvalues, and 3) freedom with eigenvectors 
corresponding to eigenvalues with multiplicity greater than 
1. Noticing that M* = 2~2l=i max(Ai — 0)q_iqj , 
the sign ambiguity and problems caused by zero-valued 
eigenvalues are readily removed in view of the form of 
the summand max(Ai — 0)qiq i r - For the last problem, 
assume one repeated eigenvalue Xi has one set of its eigen- 
vectors arranged column-wise in V = [vi, • • • , Vjt], which 
essentially spans a k-dimensional subspace ( and acts as the 
basis). So this part of contribution to M* can be written as 
max(Ai — 0) VV T . Realizing that generating a new 
set of eigenvectors via linear combination can be accounted 
for by a rotation to the original basis vectors, namely 
V = VRfexfe for R T R = I in that subspace, we have 
A,VV T = A,VR(VR) T = AjVV T . Hence the sum is not 
altered by any cause. 

In fact, building on Theorem 14, we can proceed to devise 
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Figure 1. Comparison of computation time for full SVD/eigen- 
decomposition. SVD and EIG are from Matlab built-in function (which 
essentially is wrapper for corresponding Lapack routines), and LANSVD, 
LANEIG from PROPACK. 

a more general result on any real square matrix as follows. 7 

Theorem 16. For any square matrix P £ M. nxn , the unique 
closed form solution to the optimization problem 



M* 



argmin — ||M| 
m A* 



M 



Pill 



F ' 



MhO, (31) 



takes the form 

M* = Q Diag [max( A - 1/m, 0)] Q T , (32) 

whereby P = QAQ T , for A = Diag (A), is the 
spectrum(eigen-) decomposition of P — (P + P T ) /2 and 
max (•, •) should be understood element-wise. 

Proof: See Appendix B. ■ 
The above two theorems (Theorem 14 and 16) have 
enabled a fast solution to updating J. Moreover, since 
they ensure the symmetry of output J irrespective of the 
symmetry of G, we can be assured the alternation optimiza- 
tion process converges to a solution of J that satisfies the 
constraint J ^ 0. 

III. Complexity Analysis and Scalability 

For solving the ALM problems corresponding to robust 
LRR-PSD and robust LRR, the main computational cost 
per iteration comes from either eigen-decomposition of a 
symmetric matrix or SVD of a square matrix of the same 
size. In numerical linear algebra [26], computing a stable 
SVD of matrix X £ M. nxn is to convert it to an symmetric 
eigen-decomposition problem on an augmented matrix 

"0 X T1 
X 



X 



Hence from SVD to eigen-decomposition of comparable 
size, we can expect a constant factor of speedup that 
depends on the matrix dimension. Figure 1 provides bench- 
mark results on computational times of SVD and eigen- 
decomposition on matrices of sizes ranging from very small 

'Moreover, using the similar arguments, plus Lemma 11, we are able to 
produce a nonconstructive proof to the well known results about singular 
value thresholding [20] without any use of subgradient. We will not pursue 
in this direction as it is out of the scope of this paper. 



up to 5000. Tested solvers include these provided in Matlab 
and these in PROPACK [28]. It is evident that for matrices of 
the same size, eigen-decomposition is significantly faster in 
both solver package. We will stick to the built-in functions 
in Matlab as we find in practice PROPACK is sometimes 
unstable when solving full problems (it is specialized in 
solving large and sparse matrices). 

IV. Experiments 

In this section, we systematically verify both the theoretic 
analysis provided before, and related claims. 

A. Experiment Setups 

We use two data sets throughout our experiments. 
Toy Data (TD). Following setting in [1], 5 independent 
subspaces {Si} i=1 C K 100 are constructed, whose bases 
{UjLi are generated by V i+1 = TU U 1 < i < 4, 
where T represents a random rotation and Ui a random 
orthogonal matrix of dimension 100 x 4. So each subspace 
has a dimension of 4. 20 data vectors are sampled from 
each subspace by X,; = U^Qi, 1 < i < 5 with Q; being a 
4 x 20 iid zero mean unit variance Gaussian matrix Af (0, 1). 
Collection of this clean X should have rank 20. 
Extended Yale B (EYB). Following setting in [1], 640 
frontal face images of 10 classes from the whole Yale B 
dataset are selected. Each class contains about 64 images, 
and images are resized to 42 x 48. Raw pixel values are 
stacked into data vectors of dimension 2016 as features. This 
dataset is an example of heavily corrupted data. 

B. Equivalence of LRR-PSD and LRR 

1) Spectrum Verification: Recall the key to establish the 
equivalence of LRR-PSD and LRR lies with showing that 
the eigenvalues and singular values of Z* are identical, with 
1 of multiplicity equal to the data rank and the rest 0's (Ref. 
Theorem 6 and the associated proof). In order to verify this, 
we use TD without introducing any noise, and hence the data 
matrix has rank 20. We simulate the clean settings, i.e., LRR- 
PSD and LRR by gradually increasing the regularization 
parameter A of the robust versions (19) and (20). Intuitively 
for large enough A, the optimization tends to put E = and 
hence approaches the clean settings. Figure 2 presents the 
results along the regularization path (0.1 ~ 1). It is evident 
during the passing to A = 1, the eigenvalue and singular 
value spectra match each other, and identically produce 20 
values of 1 and the rest all 0. This confirms empirically the 
correctness of our theoretic analysis. 

2) Spectrum Perturbation Under Robust Setting: As we 
conjectured in Sec. II-C, in most cases spectrum of the 
obtained affinity matrix from robust LRR-PSD or robust 
LRR will be perturbation of the ideal spectrum. Repeated 
experiments on many settings confirm about this, although 
we cannot offer a formal explanation to this yet. Here we 
only produce a visualization (Figure 3) to show how things 
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Figure 2. Comparison of the eigen-spectrum (top) and singular value 
spectrum (bottom) for clean toy data (no artificial noises added) under 
the robust settings. Increasing the value of A in the robust settings, or 
effectively passing towards the clean formulation, the optimal Z* tends to 
produce 20 nonvanishing eigenvalues/singular values of 1. Left: by solving 
LRR. Right: by solving LRR-PSD. (Please refer to the color pdf and zoom 
in for better viewing effect.) 



evolve under different noise level when we set A = 0.12. 
The noise is added in sample-specific sense, as done in [1], 
i.e., some samples are chosen to be corrupted while others 
are kept clean. We do observe some breakdown cases when 
A is very small (not presented in the figure), which can be 
partially explained by that in that case the effect of nuclear 
norm regularization is weakened. 




Eigenvalue Index 

Figure 3. Evolution of the eigen-spectrum of the learnt affinity matrix 
under different noise levels. Left: solving by robust LRR; Right: solving 
by robust LRR-PSD. Surprisingly the spectra are always confined within 
[0, 1] in this setting. (Please refer to the color pdf and zoom in for better 
viewing effect.) 



C. Selection of Noise Models 

We have argued that the norm selection for the noise 
term E should depend on the knowledge on noise patterns. 
We are going to compare the || • Id noise model with the 
|| • || 2.1 noise model used in [1]. First we test on TD. Instead 
of adopting a sample-specific noise assumption, we assume 
that the corruptions are totally random w.r.t. data dimension 
and data sample which is more realistic. We add Gaussian 
noise with zero mean and variance 0.3||X||f, where X is the 
whole data collection. Percentage of corruption is measured 
against the total number of entries in X. The evolution 
of SC performance against the percentage of corruption 
is presented in Figure 4. We can see the obvious better 
resistance against noises exhibited by the || • ||i form. 
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Figure 4. Comparison of performance using different noise models. 

In essence we use ||E||i and ||E||2,i respectively in the objective. Instead 
of sample-specific noise, we assume random distributed noises, which is a 
more natural noise model. The l\ version shows better resistance against 
noise. 

D. Performance Benchmark: LRR-PSD vs. LRR 

We benchmark for the speed of robust LRR-PSD and 
robust LRR on EYB, and also present the clustering per- 
formance as compared to the conventional Gaussian kernel 
and linear kernel SC, which is obviously missing from [1]. 
Table I presents the accuracy obtained via various affin- 

Table I 

Segmentation accuracy (%) on EYB. We record the average 

PERFORMANCE FROM MULTIPLE RUNS INSTEAD OF THE BEST, AND 
REDUCE THE DIMENSION TO 100 AND 50 RESPECTIVELY IN THE 
BOTTOM TWO ROWS. 





Gauss SC 


Linear SC 


SSC 


LRR 


LRR-PSD 


Acc. 


20.00 


30.16 


49.37 


59.53 


60.63 


Acc. (100D) 


22.66 


27.97 


49.38 


61.56 


60.00 


Acc. (50D) 


24.84 


27.97 


49.22 


62.83 


61.81 



ity matrices for SC, with different setting of PCA pre- 
processing for noise removal 8 . By comparison, obviously 
LRR-PSD and LRR win out and they are relatively robust 
against the PCA step, partially by virtue of their design to 
perform corruption removal together with affinity learning. 
To test the running time, we also include another set where 
each image in EYB is resized into 21 x 24 (Set 1). We denote 
the original setting Set 2, and use the first 20 classes of 
which each image resized into 42 x 48 to produce Set 3. We 
report the running time (T), number of iterations (Iter), con- 
vergence tolerance (Tol) for each setting. Table II presents 
the results. Interestingly, LRR-DSP always converges with 
more iterations but less running time than that of LRR. The 

Table II 

Running time and iterations on EYB. Advantage of LRR-PSD 

BECOMES SIGNIFICANT AS THE DATA SCALE GROWS UP. 



LRR/LRR-PSD 


T (sec) 


Iter 


Tol 


Set 1 


271.87/218.27 


178/330 


10" 


- b /10" 


D 


Set 2 


475.23/461.22 


193/496 


10" 


- 6 /10" 


-6 


Set 3 


3801.43/2735.48 


185/392 


10" 


- e /io- 


6 



benefit of using eigen-decomposition in place of SVD is 
apparent. 

V. Summary and Outlook 

In pursuit of providing more insights into recent line of 
research work on sparse-reconstruction based affinity matrix 
learning for subspace segmentation, we have discovered an 
important equivalence between the recently proposed LRR 
and our advocated version LRR-PSD in their canonical 
forms. This is a critical step towards understanding the 
behaviors of this family of algorithms. Moreover, we show 
that our advocated version, in its robust/denoising form, 
also facilitates a simple solution scheme that is as least as 
simple as the original optimization of LRR. Our experiments 
suggest in practice LRR-PSD is more likely to be flexible 
in solving large-scale problems. 

Our current work is far from conclusive. In fact, there are 
several significant problems remained to be solved. First of 
all we observed in experiments the robust versions most of 
the times also produce affinity matrices with only positive 
eigenvalues, and themselves are very close to symmetric. We 
have not figured out ways to formally explain or even prove 
this. Furthermore, similar to the RPCA problem, it is urgent 
to provide theoretic analysis of the operational conditions 
of such formulation. From the computational side, SVD or 
eigen-decomposition on large matrices would finally become 
prohibitive. It would be useful to figure out ways to speed up 
nuclear norm optimization problems for practical purposes. 
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Appendix 

A. Proof of Lemma 13 

Proof: Recall the fact that nuclear norm is dual to the 
spectral norm \\-\\2 (Lemma 3), the dual description follows 

Zl2 



sup 



Zn 
Z21 



Z12 
Z22 



A 
C 



Z n 
Z21 



Z22_ 

(33) 



and similarly we also have 
A 0" 



D 



8 For SSC we used the implementation provided by the authors of [14] 
with proper modification to their PCA routine. 



= sup 
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= IIAI 
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IIDII 



Z12 
Z22 

Z22 



A 
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Zn 




Z12 
Z22 

Z22 



= 1 



= 1 



(34) 



Since (34) is a supremum over a subset of that in (33), 
the inequality about the nuclear norm holds. The claim 
about the square of Frobenuis norm holds trivially from 
nonnegativeness of any block norm squares contributed to 
the total norm square. ■ 

B. Proof of Theorem 16 

Proof: Similarly the program is strictly convex and 
we expect a unique minimizer. By the semi-definiteness 
constraint, we are only interested in M 6 S n . Hence 
||M - P||| = ||M - P T \\ 2 F , which suggests the 
objective function can be cast in its equivalent form 
1/^||M||*+1/4||M-P||!+1/4||M-P T ||J,. Further we 
observe that 

||M-P||! + ||M-P T ||! = ||M-(P + P T )/2||2,+C(P) 

(35) 

where C(P) only depends on P (are hence constants) . 
Hence we reach an equivalent formation of the original 
program Eq. (31) as 

M* = argmin -||M||* + -||M-P|||, s.t. M >z 0, (36) 
MM 2 

with P = (P + P T )/2. Solution to Eq. (36) readily follows 
from Theorem 14. ■ 
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